Netflix Movies Investigation

Code
import pandas as pd
import matplotlib.pyplot as plt
from itables import init_notebook_mode

Movie popcorn on red background

Netflix! What started in 1997 as a DVD rental service has since exploded into one of the largest entertainment and media companies.

Given the large number of movies and series available on the platform, It’s a perfect opportunity for me to flex my exploratory data analysis skills and dive into the entertainment industry. I’ve also been brushing up on my Python skills and have taken an initial look at a CSV file containing Netflix data. I believe that the average duration of movies has been declining. Using my initial research, I’ll delve into the Netflix data to determine whether movie lengths are actually getting shorter and explain some of the contributing factors, if any.

This dataset netflix_data.csv , has the following table detailing the column names and descriptions:

The data

netflix_data.csv

Column Description
show_id The ID of the show
type Type of show
title Title of the show
director Director of the show
cast Cast of the show
country Country of origin
date_added Date added to Netflix
release_year Year of Netflix release
duration Duration of the show in minutes
description Description of the show
genre Show genre

Data Overview

Code
init_notebook_mode(all_interactive=True)
Code
netflix_df = pd.read_csv("netflix_data.csv")
Code
netflix_df.head(5)
show_id type title director cast country date_added release_year duration description genre
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
netflix_df.tail(5)
show_id type title director cast country date_added release_year duration description genre
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
netflix_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   duration      7787 non-null   int64 
 9   description   7787 non-null   object
 10  genre         7787 non-null   object
dtypes: int64(2), object(9)
memory usage: 669.3+ KB
Code
netflix_df.describe()
release_year duration
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
netflix_df.shape
(7787, 11)
Code
netflix_df['release_year'].agg(['min', 'max'])
release_year
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
data_null=netflix_df.isna().sum()
data_null[data_null>0]
0
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)
Code
null_percent=round(netflix_df.isna().sum()/netflix_df.shape[0]*100, 2)
null_percent[null_percent>0]
0
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

director: 30.68% are missing in the “director” column. This may be due to tv show and movies having more than one director or maybe multiple directors are not defined clearly in this dataset.
cast: 9.22% are missing. This could be due to shows and movies not listing all cast members or maybe the cast have limited role.
country: 6.51% of the entries in this column are missing, indicating that some tv shows and movie are produced multinationally, making country assignment difficult.
date_added: 0.13% are missing. Missing dates might occur if the data was scraped before the official release or if it pertains to older content.

Code
# filter data to remove movies
serials_count = (netflix_df["type"] == "TV Show").sum()
movies_count = (netflix_df["type"] == "Movie").sum()

print(f"Serials: {serials_count}, Movies: {movies_count}")
Serials: 2410, Movies: 5377
Code
serials_percent = serials_count / (serials_count + movies_count) * 100
movies_percent = movies_count / (serials_count + movies_count) * 100

print(f"Serials: {serials_percent:.2f}%, Movies: {movies_percent:.2f}%")
Serials: 30.95%, Movies: 69.05%

Data breakdown

Total TV Shows : 2,410
Total Movies : 5,377
Total Entries: 7,787

Percentage:

TV Shows: 30.95%
Movies: 69.05%


This indicates that nearly 70% of Netflix content consists of movies, while TV shows account for about 30%.

Interpretation


Netflix’s catalog appears to be movie-heavy, with more than twice as many movies as TV shows. This suggests that Netflix may prioritize standalone films over long-running TV shows. This will be made more clear if Netflix user preference can be analyzed.

Code
# filter data to remove tv shows
netflix_subset = netflix_df[netflix_df["type"] == "Movie"]
# selecting only the column of interest about netflix movie data
netflix_movies = netflix_subset[["title", "country", "genre", "release_year", "duration"]]
# filter for movie durations shorter than 60 minutes
short_movies = netflix_movies[netflix_movies.duration < 60]

# Define genre-to-color mapping
color_map = {
    "Children": "red",
    "Documentaries": "green",
    "Stand-Up": "yellow"
}

# Assign colors based on genre using .map()
colors = netflix_movies["genre"].map(color_map).fillna("black") 

# Plot
plt.style.use('default')
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(netflix_movies.release_year, netflix_movies.duration, c=colors, alpha=0.7)

# Labels and title
ax.set_title("Movie Duration by Year of Release")
ax.set_xlabel("Release Year")
ax.set_ylabel("Duration (min)")

plt.show()

Are we certain that movies are getting shorter?

answer = “maybe”

Back to top